STAT 679: Problem Set #3

Q1. Glacial Lakes

The data at this link contain labels of glacial lakes the Hindu Kush Himalaya, created during an ecological survey in 2015 by the International Centre for Integrated Mountain Development.

Part (a)

a.1 How many lakes are in this dataset?

lakes = read_sf("../data/GL_3basins_2015.geojson")
num_lakes = length(unique(lakes$GL_ID))

There are 3624 lakes in this data set.

a.2 What are the latitude / longitude coordinates of the largest lakes in each Sub-basin?

largest_lakes = group_by(lakes, Sub_Basin) %>% 
  filter(Area == max(Area)) %>% 
  arrange(-Area) %>%
  select(GL_ID,Sub_Basin,Area,Latitude,Longitude)
GL_ID Sub_Basin Area Latitude Longitude geometry
GL085838E28322N Sun Koshi 5.4113912 28.32223 85.83813 POLYGON ((85.84835 28.32543…
GL082948E29196N Bheri 4.9235690 29.19634 82.94822 POLYGON ((82.94605 29.21763…
GL086304E28374N Arun 4.0192059 28.37403 86.30475 POLYGON ((86.29541 28.35189…
GL083851E28690N Marsyangdi 3.3841276 28.69089 83.85184 POLYGON ((83.84369 28.70758…
GL086447E27946N Tama Koshi 1.7197430 27.94679 86.44713 POLYGON ((86.43581 27.93609…
GL086925E27898N Dudh Koshi 1.3428310 27.89853 86.92510 POLYGON ((86.9127 27.89883,…
GL081780E30128N Humla 0.7565747 30.12892 81.78076 POLYGON ((81.77613 30.13338…
GL087866E27869N Tamor 0.6961043 27.86953 87.86618 POLYGON ((87.85986 27.86149…
GL081526E29772N West Seti 0.5075936 29.77282 81.52618 POLYGON ((81.52834 29.77348…
GL085519E28467N Trishuli 0.4539165 28.46783 85.51938 POLYGON ((85.51812 28.47, 8…
GL083701E29218N Kali Gandaki 0.4344556 29.21850 83.70156 POLYGON ((83.7026 29.21713,…
GL082423E29384N Tila 0.4322915 29.38408 82.42375 POLYGON ((82.42288 29.38101…
GL082414E29753N Mugu 0.4283468 29.75376 82.41434 POLYGON ((82.41277 29.75636…
GL081577E29897N Kawari 0.2839747 29.89755 81.57738 POLYGON ((81.57937 29.89932…
GL084628E28596N Budhi Gandaki 0.2674945 28.59618 84.62826 POLYGON ((84.63338 28.59795…
GL080178E30564N Kali 0.2396768 30.56462 80.17877 POLYGON ((80.18417 30.56559…
GL081554E29648N Karnali 0.1262245 29.64815 81.55488 POLYGON ((81.55656 29.64652…
GL084116E28446N Seti 0.1024868 28.44681 84.11694 POLYGON ((84.11637 28.44483…
GL086542E27713N Likhu 0.0903048 27.71321 86.54286 POLYGON ((86.54486 27.71284…
GL085717E28042N Indrawati 0.0269401 28.04209 85.71757 POLYGON ((85.71581 28.04205…

Part (b)

Plot the polygons associated with each of the lakes identified in part (a).

Hint: You may find it useful to split lakes across panels using the tm_facets function. If you use this approach, make sure to include a scale with tm_scale_bar(), so that it is clear that lakes have different sizes.

tm_shape(largest_lakes %>% select(-GL_ID)) + 
  tm_polygons(col='Area', palette="Blues",legend.show=F) + 
  tm_facets(by='Sub_Basin', free.scales = F, ncol=5, scale.factor = 4) +
  tm_scale_bar()





































































Part (c)

Visualize all lakes with latitude between 28.2 and 28.4 and with longitude between 85.8 and 86. Optionally, add a basemap associated with each lake.

lakes_subset = lakes %>% 
  filter((Latitude>=28.2 & Latitude<=28.4) & (Longitude>=85.8 & Longitude<=86.0))

basemap = cc_location(loc=c(85.9,28.3), buffer=15e3)

tm_shape(basemap) + 
  tm_rgb(alpha=0.9) + 
  tm_shape(lakes_subset) + 
  tm_polygons(col='deepskyblue3') + 
  tm_layout(bg.color=page_bg_color, inner.margins=c(0,0), frame=F, saturation=-1) + 
  tm_view(set.bounds = c(85.8,28.2,86.0,28.4), set.zoom.limits =c(11,14))

Q2. Australian Pharmaceuticals II

The PBS data set contains the number of orders filed every month for different classes of pharmaceutical drugs, as tracked by the Australian Pharmaceutical Benefits Scheme. The code below takes the full PBS data set and filters down to the 10 most commonly prescribed pharmaceutical types. This problem will ask you to implement and compare two approaches to visualizing this data set.

pbs_full <- read_csv("../data/PBS_random.csv") %>% 
  mutate(Month = as.Date(Month))

top_atcs <- pbs_full %>%
  group_by(ATC2_desc) %>%
  summarise(total = sum(Scripts)) %>%
  slice_max(total, n = 10) %>%
  pull(ATC2_desc)

pbs <- pbs_full %>%
  filter(ATC2_desc %in% top_atcs, Month > "2007-01-01")

Part (a)

Implement a stacked area visualization of these data.

ggplot(pbs) + 
  geom_area(aes(Month,Scripts/1e6, col=ATC2_desc, fill = ATC2_desc), alpha=0.6) +
  scale_x_date(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0), n.breaks=11) + 
  scale_fill_brewer(palette = "Paired") + 
  scale_colour_brewer(palette = "Paired") + 
  labs(
    title = "Sales of 10 most commonly prescribed pharma drugs from 2007 till date", 
    y = "Orders sold (in millions)", fill = "CLASS OF DRUG", col = "CLASS OF DRUG"
  )

Part (b)

Implement an alluvial visualization of these data.

ggplot(pbs) +
  geom_alluvium(aes(Month, Scripts/1e6, fill = ATC2_desc, col = ATC2_desc, alluvium = ATC2_desc), decreasing = FALSE, alpha = 0.6) +
  scale_x_date(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0), n.breaks=11) + 
  scale_fill_brewer(palette = "Paired") + 
  scale_colour_brewer(palette = "Paired") + 
  labs(
    title = "Sales of 10 most commonly prescribed pharma drugs from 2007 till date", 
    y = "Orders sold (in millions)", fill = "CLASS OF DRUG", col = "CLASS OF DRUG"
  )

Part (c)

Compare and contrast the strengths and weaknesses of these two visualization strategies. Which user queries are easier to answer using one approach vs. the other?

Visualization Strategy Strengths Weakensses
Stacked Area Areas stacked on top of each other makes it easy to tell how the totals at any given time-point breakdown.
We can easily answer user queries on what drugs had the highest sales at a time point, and also identify overall trend.
Not easy to rank the drug classes on their relative sales.
Alluvial Ranking is easy to identify. At any given time-point, the streams are decreasingly ordered by drug sales contribution.
So, user-queries on what drug has the highest/lowest sales, or change in drug sales across time-points can ve easily answered with this visualization.
Not easy to gauge the total sales at a single glance as it works more as a comparison visualization.

Q3. Spotify Time Series II

The code below provides the number of Spotify streams for the 40 tracks with the highest stream count in the Spotify 100 dataset for 2017. This problem will ask you to explore a few different strategies that can be used to visualize this time series collection.

spotify_full <- read_csv("../data/spotify.csv")

top_songs <- spotify_full %>% 
  group_by(track_name) %>%
  summarise(total = sum(streams)) %>%
  slice_max(total, n = 40) %>%
  pull(track_name)

top_10_songs <- spotify_full %>% 
  group_by(track_name) %>%
  summarise(total = sum(streams)) %>%
  slice_max(total, n = 10) %>%
  pull(track_name)

spotify <- spotify_full %>% filter(track_name %in% top_songs)

# Data Preprocessing to cut down a long song name
spotify$track_name[spotify$track_name==unique(spotify$track_name)[14]] <- "I Don’t Wanna Live Forever"
spotify$track_name = trimws(str_replace(spotify$track_name, " \\(([^)]+)\\)", ""))

Part (a)

Design and implement a line-based visualization of these data.

# TODO: Save plots as images and load them in the Knit output
ggplot(spotify, aes(x=date)) + 
  geom_line(aes(date, streams/1e6, group=track_name),col="forestgreen", alpha=0.5,size=0.6) +
  scale_x_date(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0), breaks = seq(0,12,1)) + 
  labs(title="Spotify's 40 top streamed songs in 2017", x="",y="Streams (in millions)") +
  theme(axis.ticks.x = element_line(), axis.text.x=element_text(size=7, hjust=0.7), panel.grid.minor = element_blank())

Part (b)

Design and implement a horizon plot visualization of these data.

custom_greys_palette = colorRampPalette(c("black","#EAEAEA"), space = "Lab")
custom_greens_palette = colorRampPalette(c("#BAE4B3","forestgreen"), space = "Lab")
spotify_palette = c((custom_greys_palette(8)),custom_greens_palette(4))

# TODO: Save plots as images and load them in the Knit output
ggplot(spotify, aes(date,streams/1e6)) + 
  geom_horizon(aes(fill=..Cutpoints..), origin = 4, horizonscale = seq(0,12,1), alpha=0.8) + 
  facet_wrap(~ reorder(track_name, -streams), ncol=1, strip.position = 'left') +
  scale_x_date(expand = c(0,0)) + 
  scale_y_continuous(expand = c(0,0), breaks = seq(0,12,3)) + 
  scale_fill_manual(values = spotify_palette) + 
  labs(title="Spotify's top streamed songs in 2017", x="",y="",fill="Streams (in millions)") + 
  guides(fill = guide_legend(nrow=1, reverse=T)) +
  theme(
    axis.ticks.x = element_line(), 
    axis.text.x=element_text(hjust=0.9),
    axis.text.y=element_blank(), 
    strip.text.y.left = element_text(angle=0, hjust=0.5), 
    legend.position = "top")

Part (c)

Building from the static views from (a - b), propose, but do not implement, a visualization of this data set that makes use of dynamic queries. What would be the structure of interaction, and how would the display update when the user provides a cue? Explain how you would use one type of D3 selection or mouse event to implement this hypothetical interactivity.

Response:

I have two ideas for dynamic queries for these temporal visualizations.

  1. Adding a brush interaction inside a facet. When the user brushes over a region inside one facet, then that time period will be highlighted (and probably zoomed in too) for all the song facets. This would provide a microscopic analysis of the song streams in a small period of time. I would also add in a “Reset Zoom” action button to rest to original view. A non-graphical alternative to this query is to replace the brush with a date range slider.
  2. A multi-selec tool to filter artists, or tracks from the datasets. Facets will appear/disappear based on the filter chosen.

The above interactive queries can also be combined together to have a highly responsive and information dense visualization.

Q4. CalFresh Enrollment II

In this problem, we will develop an interactively linked spatial and temporal visualization of enrollment data from CalFresh, a nutritional assistance program in California. We will use D3 in this problem.

Part (a)

Using a line generator, create a static line plot that visualizes change in enrollment for every county in California.

Part (b)

On the same page as part (a), create a choropleth map of California, shaded in by the average enrollment for that county over the full time window of the data set.

Part (c)

Propose one interactive, graphical query the links the combined spatial + temporal view from (a - b). For example, you may consider highlighting time series when a county is hovered, highlighting counties when specific series are brushed, or updating choropleth colors when a subsetted time range is selected.

I want to see the time series line of only the one county when I hover over it on the map.

Part (d)

Implement your proposal from part (c).

Q5. Temporal and Geospatial Commands

For each of the commands below, explain what role it serves in temporal or spatial data visualization. Describe a specific (if hypothetical) situation within which you could imagine using the function.

a. geom_stream

geom_stream is used to make stream graphs for temporal data. Stream graphs are like stacked area chart, but oriented around a central (rather than base) line. These are also almost always smoothed slightly, so that the transitions are not too jarring. I would use it in a situation where I am supposed to keep note of how deviated a behaviour is from a pre-decided mean. For example, to plot the time it takes for me to commute to class every day (mean time is 15 mins). Somedays, I might take more time, or less time. I can visualize the differences in this plot

b. tm_lines

tm_lines is used to plot line type of vector data on a map. This is best used to plot roads of a city, or the path taken to travel from Point A to Point B on a map.

c. rast

rast() is used to read raster type geospatial data. This type of data give a measurement along a spatial grid and the metadata says where on the earth each entry of the matrix is associated with. One example usage would be to visualize the core temperature of a grographical region at a point of time

d. geom_horizon

geom_horizon is used to plot horizon plots for temporal data. The horizon plot gives a reasonable compromise between faceting and heat maps — it preserves some of the advantages of using y-position to encode value while having some of the compactness of heat maps. I would use horizon plots on data where I am only concerened about where the peaks lie, as that is the information this kind of plots can convey at a glance.

Q6. Visualization for Everyday Decisions

We routinely encounter spatial and temporal data in everyday life, from the from the dates on a concert calendar to the layout of buttons in an elevator. This problem asks you to critically reflect on the way these data are represented.

a. Describe one type of spatial or temporal data set (loosely defined) that you encounter in a nonacademic context. What visual encodings are used? Does it implement any interactivity?

I use the weather app on my phone every morning. The app shows a single-line plot of the temperature of the day every hour. It also shows a small icon next to every (hourly) point on the line to show if the sky is sunny, clear, cloudy, rainy, or snowy at the time-point, and another small water drop at the bottom showing the chances of precipitation in percentages. When I click on one time-point, the app shows a detailed description of the weather at the time.

b. What questions does the visualization help you answer? How easily can you arrive at an accurate answer?

I plan my outfits according to the results of this visualization everyday. If it is cold and snowing, I layer up and wear warm clothes. It is very easy to arrive at an accurate answer with this, as there is only one single line for the current day in this visualization, and only one point on the line for every hour.

c. In the spirit of the re-imagined parking sign project, propose an alternative representation of these data (or an alternative way of interacting with the same representation). Why did you propose the design that you did? What advantages do you think it has?

I had a fun thought on how this line plot can be shown differently. I would put up a circular clock and replace every hour point with the icons used to display the sky condition. I would fill up the circular clock 30% by water if chances of precipitation at that time is 30%. The hour hand would point towards the current hour and two small rectangular boxes inside the circle will show the numeric values of the temperature and the UV Index. I will swipe over the clock to see the same info for the next 12 hours.

One advantage it has is that I can see the weather trend of the day in a single glance. In the original line plot, I have to keep scrolling to the right to see the evening weather, given the limited screen-space of my mobile.